arXiv:1504.07758vl [cs.DB] 29 Apr 2015 


Luzzu Quality Metric Language - A DSL for Linked 
Data Quality Assessment 


Jeremy Debattista, Christoph Lange, Soren Auer 

University of Bonn & Fraunhofer lAIS 

{debattis,langec,auer}@cs.uni-bonn.de 


Abstract The steadily growing number of linked open datasets brought about 
a number of reservations amongst data consumers with regard to the datasets’ 
quality. Quality assessment requires significant effort and consideration, includ¬ 
ing the definition of data quality metrics and a process to assess datasets based 
on these definitions. Luzzu is a quality assessment framework for linked data 
that allows domain-specific metrics to be plugged in. In this paper we describe 
the Luzzu Quality Metric Language (LQML), a domain specific language (DSL) 
whose purpose is to enable non-programming domain experts to define quality 
metrics for the assessment of linked open datasets. LQML offers notations, ab¬ 
stractions and expressive power, focusing on the representation of quality met¬ 
rics. It provides expressive power for defining sophisticated quality metrics. Its 
integration with Luzzu enables their efficient processing and execution and thus 
the comprehensive assessment of extremely large datasets in a streaming way. 
We also describe a novel ontology that enables the reuse, sharing and querying 
of such definitions. Finally, we evaluate the proposed DSL against the cognitive 
dimensions of notation framework. 

Keywords: data quality, quality metrics, linked data, domain specific language 


1 Introduction 

In May 2007 the first version of the Linked Open Data cloud Q was published. Fol¬ 
lowing Linked Data principle^ 12 datasets were initially added to the LOD cloud. 
The cloud saw an increase during the years, and the data provide]]^ count is 570 (as 
of August 2014). Furthermore, an investigation (from February 2015) at the data cata¬ 
logue portal datahub.io showed that 1,309 out of 9,262 datasets are tagged with the 
lod or format-rdf tags. Moreover, other linked datasets might be available in the Web 
of Data but are not catalogued, thus they are not easily discovered by data consumers. 
Schmachtenberg et al. 113 compiled a list of uncatalogued datasets from various W3C 
mailing list communications and the 2012 Billion Triple Challenge. 

However, the increase of linked open datasets brought about a number of reserva¬ 
tions amongst data consumers with regard to the datasets’ quality. Hitzler and Janow- 
icz na state that linked open datasets have a reputation of being of poor quality. Con¬ 
sequently, if we manage to systematically assess and improve data quality of LOD, 

* http://WWW.w3.org/DesignIssues/LinkedData.html 

^ Each data provider, identified by its pay-level domain (e.g. http: / /dbpedia. orgI, can 
have more than one dataset; 1014 datasets were discovered in 2014 
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many Linked Open Data applications can be better supported and contribute to estab¬ 
lishing the 4th Big Data aspect - Veracity. 

Indubitably, quality assessment requires a lot of effort and consideration before pro¬ 
cessing a dataset. Quality is commonly described as fitness for use m. Although do¬ 
main experts can decide on a number of quality factors of a particular dataset, in the 
end it is up to data consumers to see if a dataset is suitable for their use case or not. 
Providing measures on the quality of a published linked open dataset should be part 
of the publishing lifecycle. Various research work defines a number of quality factors 
pertinent to linked open datasets 02I1OI12I . Zaveri et al. Il20l provide an overall system¬ 
atic review of such quality metrics. Additionally, domain specific quality metrics are 
required to have a more comprehensive view of the dataset’s quality. For a cultural her¬ 
itage dataset, for example, the ratio of resources being linked to an Integrated Authority 
File (e.g. GNI^ is of crucial importance. 

Luzzi^ is an extensible quality assessment framework for linked open datasets. 
Linked Data quality metrics can be added to Luzzu by various third parties (such as 
programmers and data enthusiasts). It often occurs that data scientists, whose spectrum 
ranges from data publishers and consumers to domain experts and knowledge engineers, 
might not be confident in programming using traditional third generation languages. 
Nevertheless, they are considered to be the ideal drivers for defining domain specific 
quality metrics, which can be used on linked open datasets. 

The main contribution of this article is the definition and implementation of the 
Luzzu Quality Metric Language (LQML), a domain specific language (DSL) that en¬ 
ables declarative definition of quality metrics for Luzzu (cf. Section]^. LQML offers 
notations, abstractions and expressive power, focusing on a the representation of qual¬ 
ity metrics for Linked Dataset assessment. A particular challenge in the definition of 
LQML was to balance between providing the expressive power for defining sophistic¬ 
ated quality metrics on the one hand and ensuring their efficient processing and exe¬ 
cution on the other hand. As a result, our implementation enables the comprehensive 
assessment of extremely large datasets with respect to many quality metrics in a stream¬ 
ing way. LQML is designed in a way that metrics can be written by non-programmers 
that are experts in the domain (cf. 08ll3ll8I L Hudak lfT3l suggests that DSLs have the 
potential to improve productivity in the long run and with LQML we aim to contribute 
to overcoming one of the main problems of Linked Open Data - data quality. 

We also define a new vocabulary to enable the reuse, sharing and querying of LQML 
metrics in a semantic manner (cf Section |^. The usability of the LQML is system¬ 
atically assessed (cf. Section]^ against the “cognitive dimensions of notation” (CD) 
evaluation framework. These dimensions provide a comprehensive view of how users 
can manage and use a defined language. We also briefly outline the state-of-the-art in 
domain specific languages (cf. Section and concluding remarks and an outlook on 
future work are discussed in Section|6l 


^ German “Gemeinsame Normdatei”; see http : / /www. dnb . de/EN/gnd 
"^http://eis-bonn. git hub . io/Luzzu 
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2 Luzzu Quality Metric Language 

The Luzzu Quality Metric Language (LQML) is a structural declarative language that 
enables the definition of quality metrics (called blueprints) in Luzzu. Based on our 
experience from the use cases of the DIACHRON FP7 EU projecj^ we anticipate 
that most domain-specific quality metrics are very similar structure-wise, with minor 
changes required only in the rules’ conditions. 

2.1 Analysis 

Data quality assessment varies from one domain to another. Although there exist a num¬ 
ber of generic quality metrics as defined in EOl . different domains might require the 
assessment of different features. For example, where in geographical datasets the prop¬ 
erties geo : long and geo : lat are absolutely required for resources that are defined 
as a place (such as country and city), these properties might be redundant in health 
oriented datasets. The idea of LQML is that data scientists can define various quality 
metrics over a dataset (or a domain of datasets). These declarative definitions are trans¬ 
lated into Java byte-code (see Section [23] l and integrated within the Luzzu framework. 
For the proposed domain specific language, we identified a domain terminology based 
on quality metrics required by pilot partners in the DIACHRON project. 

Use case overview 

EBI - One of the services of the European Bioinformatics Institute (EBQ is to provide 
linked datasets to the scientific community, with their main development focusing 
around the Experimental Factor Ontology (EFO). The EFO ontology is then used 
to annotate data in databases at the EBI. EFO is an evolving ontology by nature and 
concepts from external ontologies are constantly being added (or replaced) in the 
EFO. 

Data Publica - This French companjQprovides a number of data services, which in¬ 
clude the management of the largest and most complete directory of electronic data 
in France. This directory covers all data available in France (private and public), 
annotating it with relevant metadata, and making it available to the public through 
various means (search engines, visualisations, etc.). 

Domain Terminology: A typical quality metric definition for linked open datasets con¬ 
sists of a pattern matching condition, (i.e. matching the subject (?s), predicate (?p), 
object (?o), or a mixture of these three with possibly advanced inspection), and an 
consequent action. This resembles the traditional if... then statements of programming 
languages. The full representation of an LQML metric definition is termed as blueprint. 
The feature model in Figure[T]describes the features required to create a quality metric 
blueprint. A blueprint description should have enough information to assess a dataset 
based on the quality criteria {Pattern Matching Rules in Figure [T]), and to enable the 

5 

6 
7 


http://diachron-fp7.eu 

http://WWW.ebi.ac.uk 

http://WWW.data-publica.com 
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Figure 1. Feature Model for blueprints. 


semantic description {Semantic Representation) of the quality metadata for the criteria 
in question. A description should also have a Human-Readable Description. This is re¬ 
quired since blueprints will be shared amongst different data scientists and thus would 
enable anyone to understand complex patterns and actions. All features are necessary 
in order to create one blueprint. 

In a more formal manner, let r be the root feature of a blueprint, fi, / 2 , /a represent 
the Semantic Representation feature. Human Readable Description feature, and Pattern 
Matching Rules feature respectively such that: 

r <—^ /i A /2 A /a (1) 

meaning that /i, / 2 , /a are mandatory features of r. 

Similarly, 

/i ^ /4 A h (2) 

where /4 and /s represent the Metric Semantic Resource (URI) and Assessment Result 
Respresentation features respectively; 

/2 ^ /e A /7 (3) 

where /g represents the Label feature and /y the Description feature; 

/s <-^ /s A /g (4) 

where /§ represents the Pattern feature and /g the Action on Match feature. 

2.2 Design 

Having identihed the features for the proposed domain specihc language, we here con¬ 
cisely describe its design and its features. Mernik et al. ifTSll describe a number of design 
patterns, three based on language exploitation (designs based on existing languages). 
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and another one for language invention. Our proposed DSL is based on the language in¬ 
vention design pattern, where we fuse a number of specific terms (such as typeOf) from 
the use cases available together with variable binding expressions used in the syntax of 
SPARQL (i.e ? s ?p ? o) to refer to specific elements in a triple. 

Quality Metric (Blueprint) Structure: A blueprint definition of a metric starts with the 
def keyword and has a rule semantics. Each blueprint consists of the three features 
mentioned in Section IZTI 

Pattern Matching Rules Feature: Declarative patterns start with the keyword mat ch 
(Pattern feature). If a triple matches the given condition, the given action (Action on 
Match feature) is triggered. Any input triple (fg, tp,to) is matched against the conditions 
that follow the match keyword, enclosed into curly brackets ({ }). A rule can have one 
or more conditions. Conditions can be connected via the logical and (&) operator or the 
logical or (\) operator. Table [T] shows possible conditions. 


Name 


Description 



checks the type of subject or object. 

typeof(?s 

?o) 

typeof (?s) == <U> translates to the triple pattern “?s a <U>”; 
typeof (?o) == <U> translates to the triple pattern “?o a <U>”. 

?s 1 ?p == 

<U> 

matches the snbject (?s) or the predicate (?p) against a given IRI (<U>). 

?o == <U> 

X 

matches the object (?o) against a given IRI (<U>) or a literal (x). 


Table 1. Pattern Matching Conditions 


A condition can trigger one or more of the following act ions: 

- map (? s, ?o ) adds the subject and the object to a hash map as key/value (where 
the value is a list of objects); 

- count increments a counter; 

- unique (?s | ?p | ?o) increments a counter only if a unique instance of ?s, 
?p, or ?o is encountered. 

In order not not limit LQML expressiveness, programmers can develop custom 
functions which can then be imported into Luzzu. This enables data scientists to auto¬ 
matically use the imported functions in their defined match pattern. Many domain 
specific languages, including XPath (cf. Section [^, not only have built-in function^ 
but also enable implementations to provide additional external functions. 

Human Readable Descriptions: Descriptive human-readable comments are also re¬ 
quired in these blueprints. We provide the keywords label and description to 
provide the metric’s name and its textual description; they translate to rdf s : label 
and rdf s : comment. 


http://WWW.w3.org/TR/xpath-30/#id-function-calIs 
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Semantic Representation: The definition also expects other information that describes 
a quality metric. The metric {Metric Semantic Resource (URI) feature) keyword ex¬ 
pects a quality metric resource URL These resources are defined in a vocabulary that ex¬ 
tends the Dataset Quality Ontology (daQ). The finally {Assessment Result Repres¬ 
entation feature) keyword takes a defined function, and returns an output value, which 
is used as daQ observation value. 

The finally keyword can have one of the following parameters: 

- actionresult (x) takes the value of the action, where x stands for map, count, 
or unique; 

- r at io ( X, y ) takes two parameters, which can either be integer or float numbers, 
or even a function (e.g. count) that returns a numeric value. The ratio function 
divides x by y. 


2.3 Implementation 

The LQML grammar is implemented in JavaCC (Java Compiler CompilerJavaCC is 
a parser generator and a lexical analyser, where the grammar is specified in EBNF nota¬ 
tion. Blueprints defined in LQML are interpreted by the JavaCC compiler where each 
blueprint is then interpreted and transformed into a Java class during Luzzu’s runtime. 

The following listing shows the LBNF grammar for the main parts of the LQML 
syntax. 

<Definition> ::= <Def> <Metric> <Label> <Description> <Match> <Action> 

<Finally> 

<Def> ::= "def" <LBrace> <Strict_Str> <RBrace> <Colon> 

<Metric> ::= "metric" <LBrace> <IRIref> <RBrace> <SemiColon> 

<Match> ::= "match" <LBrace> (<Cond.ition>) + <RBrace> 

<Cond.ition> ::= <LParen> <TypeOf> | <DefinedFunction> | <other> <RParen> 
{<logical_operator>)* 

<TypeOf> ::= "typeof" <LParen> "?s" <RParen> <boolean_operator> <IRIref> 

<other> ::= <LParen> "?s" <boolean_operator> <IRIref> <RParen> 

I <LParen> "?p" <boolean_operator> <IRIref> <RParen> 

I <LParen> "?o" <boolean_operator> ( <IRIref> I <Quoted_Str> ) <RParen> 

<DefinedFunction> ::= <Strict_Str> "(" ("?s" I "?p" I "?o")* ")" 

<IRIref> :;= refer to RFC 3987 

<Action> ::= "action" <LBrace> {(<Map> | <Count> | <Unique>) (",")* ) + 

<RBrace> 

<Map> ::= "map" <LParen> {"?s" | "?p" | "?o") ("?s" | "?p" | "?o") <RParen> 

<Count> ::= "count" 

<Unique> ::= "unique" <LParen> {"?s" | "?p" | "?o") <RParen> 

<Finally> ::= "finally" <LBrace> {<Number> | <ActionResult> <Ratio>)+ 

<RBrace> 

® https://java.net/projects/javacc 
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<ActionResult> : := ’’actionresult" <LParen> ("map" | "count" | "unique") 
<RParen> 

<Ratio> ::= "ratio" <LParen> (<Number> | <NumericFunction>) "," ( <Number> | 

<NumericFunction>) <RParen> 

<NumericFunction> ::= <Map> | <Count> | <Unique> 

Listing 1. LQML EBNF grammar 


External Functions: External functions, defined as Java classes, are preloaded into 
Luzzu beforehand. These can only be used within a match pattern. The structure (as 
described in the EBNF <Def inedFunct ion>) requires a function name (as a string) 
and zero or more variables. 


2.4 Blueprint Examples 


In order to keep up the quality within the EFO, domain experts from the EBI (cf Sec- 
tion |2.1[ ) defined relevant quality metrics. One relevant metric is that they identify a per¬ 
centage of how many resources are actually defined as sub-classes (rdf s : subClassOf) 
of other classes. Listing]^ shows an LQML metric definition for the above. 


def{ SubClassCounter }: 

metric {<http://www.example.org/ebiqm#SubclassCountingMetric>}; 
label( "SubClassCountingMetric"} ; 

description{ "Provides a measure for counting the number of resources that 
are defined as sub-classes"}; 

match{ {?p == <http://www.w3.Org/2000/01/rdf-schema#subClassOf>) }; 

action{count, unique (?s)}; 

finally{ratio(actionresult(count), actionresult(unique))}. 


Listing 2. EBI Use Case Example in LQML 


One of Data Publica’s requirements is that each resource they define has a human 
readable description or label. This means that they quantify a percentage of how many 
resources have either an rdf s : label or an rdf s : comment defined. This metric is 
defined by LQML in Listing]^ 


def {HumanReadableLabel}: 

metric {<http://www.example.org/dpqm#SubClassCountingMetric>}; 
label {?Human Readable Labelling Metric"}; 

description} "Provides a measure for identifying the ratio of human readable 
labels of defined resources in a dataset?}; 
match}(typeof (?s) == <http://www.example. 0 rg/dp#Class>) && ((?p == <http:// 
WWW.w3.org/2000/01/rdf-schema#label>) I I (?p == <http://www.w3.org 
/2000/01/rdf-schema#comment>)))}; 
action{count, unique (?s)}; 

finally{ratio(actionresult(count), actionresult(unique))} . 


Listing 3. Data Publica Use Case Example in LQML 
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Figure 2. Luzzu Blueprint Ontology. 


3 Sharing Blueprints - Luzzu Blueprint Ontology 

We envisage that blueprint descriptions can be stored and shared in a common pool of 
metrics, similar as how different users of the IFTTT servic^^can share rule recipes. 
Shared blueprints can be either reused or modified to fit the purpose of another use 
case. The LQML language itself does not enable such sharing to be done with ease. 
For this purpose, we propose an ontology, the Luzzu Blueprint Ontology (prefix: Ibo), 
which facilitates the semantic representation of LQML blueprints. Exploiting this op¬ 
tion, the ontology enables us to distribute blueprints as semantic resources. Moreover, 
these semantic resources are easily queried and visualised. 

In line with the Semantic Web principles, the Luzzu Blueprint Ontology (depicted 
in Figure]^ reuses known concepts and domain specific ontologies. 

Listing 1^ shows an RDF definition of the root feature (cf. Figure [T]), Blueprint 
(defined as Ibo : Blueprint in the proposed ontology). 

Ibo:Blueprint a rdfs:Class . 

Listing 4. Defining the root feature r in Turtle notation 


If This Than That is an online service allowing users to create simple rules that trigger events: 
https://ifttt.com/ 
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The vocabulary incorporates the three m&m features described in Section 2.1 
mantic Representation (highlighted in light red 


Se- 


top right). Human Readable De¬ 
scriptions (highlighted in light green - top left), and Pattern Matching Rules (high¬ 
lighted in light blue - bottom part). Therefore, the formal definitions described in Equa¬ 
tions BIUlll and 1^ are given a semantic RDF definition with the Luzzu Blueprint 
Ontology. 

For the human readable descriptions, we make use of the standard rdf s : label 
and rdf s : comment properties to represent the blueprint’s label and description re¬ 
spectively. From the W3C RDF Schema definition of rdf s : label and rdf s : comment, 
these two properties are used to provide a human-readable name and description of a 
resource respectively. Both properties must have an rdfs: Resource as a domain 
(which instances of Ibo : Blueprint automatically are) and a literal (rdfs : Literal) 
as its range, i.e. a textual value. These two properties facilitate the Semantic Web nota¬ 
tion for Equation!^ 

The semantic representation (cf. Listing]^ of a quality metric is represented in an 
instance of the blueprint ontology using the proposed properties Ibo: relatedTo 
and Ibo: has Re suit. The former links a blueprint instance to a da q: Metric re¬ 
source. The latter represents the resulting (finally in terms of LQML) output. For 
this purpose we also introduced two sub-classes of the concept Ibo : OutputResult; 
Ibo: Ratio and Ibo: ActionResults. The semantics for the latter sub-classes 
were discussed in SectionAn Ibo : OutputResult can have one or more out¬ 
put parameters. The proposed property Ibo : hasOutputParameters expects a re¬ 
source of type rdf : List as its range. 


# Representing the Metric Semantic Resource (URI) feature 
Ibo:relatedTo a rdf:Property ; 

rdfs:domain lbo:Blueprint ; 
rdfs:range daq:Metric . 

# Representing the Assessment Result Representation feature 
Ibo:hasResult a rdf:Property ; 

rdfs:domain Ibo:Blueprint ; 
rdfs:range Ibo:OutputResult . 

# ... definitions of OutputResult, subclasses and hasOutputParameters 

property. 

Listing 5. Defining the Semantic Representation feature. Equation using a Semantic Web 
notation 


The Digital.me Rule Management ontology (DRMO) enables users to express rules 
in terms of resources and concepts that are available in a personal knowledge base Q. 
Although it is expected to operate in a closed-world environment, its flexibility of be¬ 
ing a domain-independent ontology enables its reuse in other scenarios. The DRMO 
concepts are inspired by the Event-Condition-Action (ECA) pattern, where the latter 
pattern is used in event-driven architectures. In light of the pattern matching rules fea¬ 
ture, we make use of the DRMO concepts, creating a sub-class of drmo : Condition, 

Ibo : TypeOf ; and three sub-classes of drmo : Action, namely Ibo : Map, Ibo : Count, 
and Ibo:Unique. The representation of a condition (the Pattern feature - /g), is 
left intact. On the other hand, we extended the concept drmo: Act ion the Action 
on Match feature - /g) with a property named Ibo: hasParameters, where the 
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range of the newly proposed property is a resource of type rdf: List. The prop¬ 
erty hasPatternMatchingRule (cf. Listing]^ defines the described extended 
drmo:Rule of an Ibo : Blueprint. The semantics of the drmo:Rule concept, 
together with the properties drmo : isComposedOf and drmo : triggers, gives a 
semantic definition for Equation]^ 

# Representing the relationship between the root feature (r) and the Pattern 

Matching Rules feature /a 

Ibo:hasPatternMatchingRule a rdf:Property ; 
rdfs:domain Ibo:Blueprint ; 
rdfs:range drmo:Rule . 

# ... definitions of subclasses for Action (Map, Count, Unique), and 

hasParameters extension to DRMO. 

Listing 6. Defining the Pattern Matching Rules feature /a 

Debattista et al. I?) provide a mechanism that transforms drmo : Rules into SPARQL 
queries. Therefore, blueprint patterns can be transformed into SPARQL queries and re¬ 
used even within quality frameworks, such as RDFUnit m, employing a SPARQL 
engine as their main assessment tool. This makes Luzzu blueprints interoperable with 
other quality assessment frameworks. 


4 Evaluation 

In order to assess the usability of LQML, we gauge the language systematically against 
the “cognitive dimensions of notation” (CD) evaluation framework, a methodology de¬ 
veloped in 0 This evaluation framework has previously been applied to Semantic Web 
languages (e.g. ini). These dimensions provide a comprehensive view of how users 
can manage and use a defined language. Each dimension describes a specific aspect in 
relation to the language notation. Blackwell and Green ID describe the following 13 
dimensions: 

Viscosity questions the effort required by the user to lead out a change. 
Assessment: LQML metrics can be defined using a simple text editor. Each statement 
is defined for a particular definition (blueprint) and is not related to other definitions. 
Therefore, changing a statement in a definition does not require a change in any other 
place, thus resulting in a low viscosity. 

Premature Commitment measures any planning required before leading out a task. 
Assessment: Based on declarative programming, LQML users only require to define 
rules based on the patterns they want to match. Also, declarations are not required 
before a blueprint definition. The only premature commitment is that metrics have to be 
defined in an ontology (whose URI is defined in the blueprint definition) based on the 
daQ ontology. 

Hidden Dependencies measures if dependencies are specifically indicated in all 
existing directions. 

Assessment: Blueprint definitions cannot be connected to each other, therefore each 
definition has a fixed rule and action, together with other descriptions. 

Error-proneness measures the possibility of users making mistakes while using the 
language. 
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Assessment: A definition is made up of only six components. This means the learning 
curve is not too steep. However, since these six components must always be fixed in 
the same order, i.e. metric, label, description, match, action, finally, 
there is an increased possibility of the user making a mistake, but this is mitigated by 
error messages from the LQML parser. 

Abstraction measures high level concepts which are not easily grasped by the users, 
since they do not refer to concrete instances. This dimension thus measures the lan¬ 
guage’s abstraction level. 

Assessment: In LQML, we try to keep the number of keywords at the pattern matching 
feature to a minimum, such that users can have full control on their declarative patterns. 
In this way there is a very low level of abstraction. 

Secondary Notation indicates the availability of options for encoding extra context 
information within the syntax itself, such as comments. 

Assessment: A definition requires a description; further important information can be 
added in an unstructured way as comments (starting with #, extending to the end of the 
line). 

Closeness of Mapping measures the degree of similarity between the representa¬ 
tion language and the real-world domain. 

Assessment: Our aim is to try to simplify the definition of metrics as much as possible, 
keeping in mind that possible non-Java experts are using this tool. Despite having this 
beneficial feature that widens the tool’s audience, expert users who require to create 
more complex metrics, for example, calculating the response time of a server serving 
a resource, must implement LQML extension functions in Java; metrics with complex 
matching conditions and actions will even have to be implemented completely in Java. 

Consistency measures the usability of the language; in other words, how easy is it 
for a user to write similar blueprints once the notation pattern has been learned. 
Assessment: Unlike in the error-proneness dimension, we here consider that the fixed 
syntax structure is actually a feature, in a way that consistency is kept for all blueprint 
definitions. 

Diffuseness measures the space required by the notation; i.e. the amount of work¬ 
space occupied by the language. 

Assessment: Although the blueprints themselves have a clear goal, the rules within the 
definition might be messy and unclear since different conditions have to be defined in 
brackets. In LQML, users have to define the precedence of evaluating the conditions 
(using brackets). The fact that LQML blueprints are defined in a simple text editor 
means that users might find some difficulty in understanding a rule. 

Progressive Evaluation measures the understandability of the language even for a 
solution that is incomplete. The possibility to try out a partial solution helps users in 
further understanding their work 

Assessment: It is possible to incrementally refine definitions by, e.g., starting with a 
partial match and a simple ‘count’ action, and then to further refine the matching pattern 
by adding conditions, and to define a more complex action. 

Role Expressiveness indicates the language’s notation and its expressiveness vis-a- 
vis the whole solution. 

Assessment: Our tool is aimed towards the definition of quality metrics for linked data. 
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In a definition, all required information is adequately labelled to enable easy identifica¬ 
tion. 

Visibility measures the degree of visibility of the language’s notation. If concepts 
are encapsulated into concepts of a more abstract level, this reduces the visibility of the 
notation. 

Assessment: All available notation is directly visible to the user. 

Provisionality measures the ability of the language to allow users to explore poten¬ 
tial options. 

Assessment: Similarly to the secondary notation dimension, potential options can be 
explored by temporarily commenting out parts of a definition. 

Together the assessment w.r.t. these dimensions provides a comprehensive heur¬ 
istic guide of LQML, particularly focusing on language features that have not been 
implemented in an immediate response to the given quality assessment requirements. 
From this evaluation we can identify certain problems in the current implementation 
of the syntax, such as the possibility of reusing components of blueprints within oth¬ 
ers. These heuristics also stress the importance of the need of a better presentation 
view tool (graphical interface) for the user, while also highlighting that whilst we are 
widening the scope of metric definition for non-Java experts, we are limiting ourselves 
to simple pattern matching metrics and thus more complex metrics cannot be defined. 
These measurements will help us in the second phase of the language definition. 


5 State of the Art 

A Domain Specific Language (DSL) is a small declarative programming language fo¬ 
cusing on a particular domain, offering appropriate notations and abstractions, in a way 
that is easy to use for non-programmers II8I13I18II . The authors of 081131 describe a 
number of benefits of DSLs, including; 

- the enhancement of productivity; 

- the incorporation of domain knowledge; 

- the possibility of portability; 

- the understandability of declarative programs by domain experts themselves; 

- easily maintainable code. 

A DSL development methodology starts with the decision stage, where stakeholders 
decide if the effort in investing in a new DSL pays off in the future. If the stakeholders 
decide to go ahead, then they proceed to the analysis stage, where the problem domain 
is identified and knowledge about that domain is gathered. Following that, the DSL 
is then designed where the knowledge is concisely described as semantic notations and 
graphically by using tools such as a feature mode|^ Finally, the DSL is implemented. In 
the article “When and How to Develop Domain-Specific Languages”, Mernik et al. m 
provide the reader with a comprehensive insight on DSL development methodologies, 
by identifying patterns for the four stages of the development methodology. 
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http://en.wikipedia.org/wiki/Feature_model 








Luzzu Quality Metric Language 


13 


In this section we mention a few examples from a growing list of domain specific 
languages. Each DSL builds on a data model to encapsulate domain knowledge into an 
abstract notation. 

Domain specific languages are popular within various applications. MgX is a doc¬ 
ument preparation typesetting system usually used for technical and scientihc publica¬ 
tions. HTML and XML are generic markup languages that are also DSLs. The former 
is used to generate websites, whilst the latter is used as an interoperable data model. 
XPatlp]is an expression language enabling the processing of values in an XML data 
model. XPath uses path expressions to navigate through XML. RuleMLp]is an XML 
markup language that allows rules to be dehned using a formal notation. 

In the Semantic Web there are a number of domain specific languages. The RDL 
data mode|^is based on triple statements (subject-predicate-object) that enable the 
description of real-world objects as machine-readable semantic resources. On top of 
this data model, RDL Schem^^is a vocabulary that provides a number of classes and 
properties to describe a resource in RDL. This schema language also provides the basic 
concepts for the development of new domain specific ontologies. The RDL data model 
can be serialised in different formats, such as RDL/XML and Turtle. SPARQlj*^ is a 
domain specific query language for querying the RDL data model. The Web Ontology 
Language (OWLj^ adds more semantics on top of the RDL Schema. OWL enables 
users to create inferencing rules and statements on RDL. Going further away from the 
data model, the Rule Interchange Lormat (RIL) lITSi is a web standard defining an inter¬ 
change language for rules within different systems to achieve interoperability. It focuses 
on the definition of various dialects, which enables the exchange of rules within differ¬ 
ent systems. The above mentioned DSLs are just a few examples tackling different 
aspects of the RDL data model. Similar to these, the proposed Luzzu Quality Metric 
Language is also based on this semantic data model. 


6 Concluding Remarks 

Data quality assessment is crucial for the wider deployment and use of Linked Data. 
With the Luzzu Quality Metric Language we empower domain experts who are not 
proficient in using third generation programming languages to define domain specific 
quality metrics for linked open datasets. We dehned the Luzzu Blueprint Ontology to 
ensure that quality metrics dehned with our proposed domain specihc language can 
be shared, queried and reused easily in a semantic manner. LQML was evaluated sys¬ 
tematically against the Cognitive Dimensions of Notation, a methodology developed 
purposely to assess formal notations such as those of programming languages. The eval¬ 
uation pointed out shortcomings in the current implementation of the DSL and possible 
future improvements. 

http://WWW.w3.org/TR/xpath-30/ 
http : / /ruleml. org 

http://WWW.w3.org/TR/rdf11-primer/ 
http://www.w3.org/TR/rdf-schema/ 
http://www.w3.org/TR/rdf-sparql-query/ 
http : / / WWW. w3 . or g/TR/owl 2-over view/ 
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Together with the Luzzu framework, we see this as the first step to break the poor 
quality reputation barrier of Linked Open Datasets. Regarding future work, we aim 
to create an interactive user interface for the definition of LQML metrics in order to 
visualise and author metrics, and to offer an online pool of quality metrics that can be 
queried and downloaded into Luzzu, and to which the Linked Open Data Community 
can contribute. We also plan to refine LQML with more generic keywords that can be 
used within the match, action, and finally parts of a definition. 
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